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Abstract. Covariate balance is a conventional key diagnostic for methods 
used estimating causal effects from observational studies. We show covariate 
balance can also be directly used in the estimation of mean causal effects by 
studying a recently proposed entropy maximization method called Entropy 
Balancing (EB). In order to avoid the tautology of repeatedly checking co¬ 
variate balance, EB exactly matches the covariate moments for the different 
experimental groups in its optimization problem. We find that the primal and 
dual problems of EB correspond, respectively, to the implicit linear model of 
outcome and propensity score logistic regression, with features being the bal¬ 
anced moments. Consequently, we prove EB enjoys some desirable statistical 
properties: it is doubly robust with respect to the linear models and reaches 
the asymptotic semiparametric variance bound when both models are correct. 
Our theoretical results and simulations suggest EB is a very appealing alter¬ 
native to the conventional propensity score weighting estimators. 


1. Introduction 

Consider an observational study, in which one of two conditions (denoted “treat¬ 
ment” and “control”) are not randomly assigned. Such observational data often 
serve as the basis for many biomedical, economical and social researches, where the 
researchers wish to study causality between treatment and some response but fail 
or are not able to run a fully randomized experiment. Making a causal conclusion 
from observational data is essentially difficult because treatment exposure may be 
related to some attributes that are also related to the outcome, so the treatment and 
control groups may be seriously imbalanced in these attributes. If the researcher 
simply ignores this imbalance, the causal effect estimator could be severely biased. 
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ENTROPY BALANCING 


To adjust for covariate imbalance, the seminal work of [Rosenbaum and Rubm| 
(1983) pointed out the essential role of propensity score, the probability of exposure 
to treatment conditional on observed covariates. Many statistical methods based on 
propensity score are subsequently proposed to estimate the mean causal effect. A 
popular approach is to stratify the observations based on their estimated propensity 
scores (Rosenbaum and Rubin 1984). The mean causal effect is then estimated by 


the weighted average of within-stratum effects. An alternative method (e.g. Hirano 


and Imbens 2001) is to construct weights for individual observations by using their 


estimated propensity scores. 

While propensity score stratification is usually more robust to misspecified mod¬ 
els, propensity score weighting has more attractive theoretical properties. [Hiranoj 


et al. (2003) showed that nonparametric propensity score weighting may achieve 


the semiparametric efficiency bound for the estimation of mean causal effect derived 
by Hahn (1998). Another desirable property is double robustness. The pioneering 


work of (Robins et al. 1994) pointed out that propensity score weighting can be 
further augmented by an outcome regression model. The resulting estimator has 
the following so-called double robustness property: 

Property 1. If either the propensity score model or the outcome regression model 
is correctly specified, the mean causal effect estimator is statistically consistent. 

However, propensity score methods commonly suffer from a serious drawback in 
practice. Often, empirical covariate imbalance exists between the treatment and 
control groups. Propensity score models often fail to correct for this imbalance. As 


a result, the estimator of causal effect may be severely biased (Drake, 1993 Smith 


and Todd 

1—1 

o 

o 

Kang and Schafer 

to 

o 

o 


is the most likely root cause, as the true propensity score always will stochastically 
balance the associated covariates. In order to avoid model misspecification, applied 
researchers usually increase the complexity of the propensity score model until a 
sufficiently balanced solution is found. This cyclical process of modeling propensity 
score and checking covariate balance is sometimes criticized as the “propensity score 


tautology” (Imai et al., 2008) and, moreover, has no guarantee of even finding a 


balanced solution. 

To avoid repeatedly checking the covariate balance, an alternative is the mul¬ 
tivariate matching method based on Mahalanobis distance, which has appealing 
theoretical properties if covariates have ellipsoidal distributions (Rubin, 1976a|b ). 
However, such assumptions are rarely valid with actual data, so the multivari¬ 


ate matching method may actually make covariate balance worse (Sekhon 2011). 


Moreover, matching procedures often discard a large portion of the data and thus 
lose efficiency. 

Recently, some new approaches have been proposed to incorporate covariate bal¬ 
ance directly in the procedure, so covariate balance can be satisfied automatically. 
For instance. Diamond and Sekhon (2013) devised a matching algorithm called Ge¬ 
netic Matching that maximizes the balance of observed covariates between treated 
and control groups. Imai and Ratkovic (20141 proposed to include covariate balance 
in the maximum likelihood estimation of propensity score logit model. 

In this paper, we study another method for covariate balance called Entropy Bal¬ 
ancing (hereafter EB) proposed by Hainmueller (2011). EB is easily interpretable 
and fast to compute, thus has already gained some popularity in applied fields such 
as applied econometrics and social sciences (Marcus 2013 Ferwerda 2014). In a 
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nutshell, EB solves an entropy maximization problem under the constraint of ex¬ 
act balance of covariate moments. It puts weights on the control units and can 
be viewed as an extension to the matching methods which puts equal weights on 
matched observations. 

We show Entropy Balancing is indeed a very attractive alternative to the con¬ 
ventional propensity score weighting methods. We find EB simultaneously fits a 
logistic regression model for the propensity score and a linear regression model for 
the outcome. The features of these regression models are the covariate moments 
being balanced. In Theorem we show EB has a double robustness property: if 
at least one of the two models are correctly specified, it is consistent for the Pop¬ 
ulation Average Treatment effect for the Treated (PATT), a common quantity of 
interest in causal inference and survey sampling. Moreover, EB is semiparametric 
efficient if both models are correctly specified. The two (generalized) linear models 
have an exact correspondence to the primal and dual optimization problem used 
to solve EB, revealing an interesting connection between doubly robust estimation 
and convex optimization. 

Since EB does not have any explicit model and the doubly robust property we 
prove is with respect to the simplest (generalized) linear models, we call EB has a 
“minimal” double robustness property. Furthermore, the minimality of EB reveals 
an interesting new relationship between double robustness and covariate balance. 
Traditionally, one may view the propensity score model and the outcome regression 
model as contributing independently to the robustness of the causal effect estimator. 
That is why Robins et al. (1994) named the property “double robustness”. Our 
study of EB reveals that at least certain forms of the two models are equivalent 
to each other. The two models essentially produce balanced groups, which reduce 
the bias of the causal effect estimator. Our understanding of this relationship of 
covariate balance and double robustness is shown in Figurej^and is more thoroughly 
discussed in Section FfH 



Figure 1. The central role of covariate balance in observational 
study. Dashed arrows: conventional understanding of double ro¬ 
bustness. Solid arrows: our understanding of double robustness 
via entropy balancing. 


The rest of the paper is organized as follows. In Section we give a brief 
review of causal inference and doubly robust estimators. In Section]^ we introduce 
Entropy Balancing and in Section we discuss the theoretical properties of EB. 
The empirical performance of EB is studied in Section by simulations. Finally 
in Section ii we discuss some extensions of EB and in Section [7] we discuss how 
propensity scoring and outcome regression are linked by covariate balance. 














4 


ENTROPY BALANCING 


2. Background 


First, we review some building blocks of the causal model for non-randomized 


studies first introduced by Rubin (19741 and the framework of doubly robust esti¬ 


mators. Many of the concepts introduced here are deeply connected to our analysis 
of EB in the following sections. 

2.1. Rubin’s Causal Model. The causal inference problem we consider is best 
explained by the notion of potential outcome. In Rubin’s causal model, each unit i 
is associated with a pair of potential outcomes: the response Yi{l) that is realized 
if Ti = 1 (treated), and another response Ti(0) realized if = 0 (control). Notice 
that it is impossible to observe both F(0) and E(l) on the same unit. This is often 
referred as the “fundamental problem of causal inference” (Holland, 1986). 


We assume the observational units are independent and identically distributed 
copies from a population, for which we wish to infer the treatment’s effect. In this 
paper we only consider continuous outcome and focus on the Population Average 
Treatment effect on the Treated (PATT): 


( 1 ) 


7 = E[y(l)|T = 1] - E[y(0)|T = 1] ^ Mill) - M0|1). 


This quantity also occurs in survey sampling with missing data ([Cassel et al. 1976 


Kang and Schafer 2007). The methods that we consider in this paper also naturally 


extend to survey sampling problems. 

Another common quantity considered in causal inference is the Population Av¬ 
erage Treatment Effect (PATE) 

r = E[y(i)]-E[r(o)]^Mi)-Mo), 

which is related to PATT through the identity 

m(0) = m( 0|0)P(T = 0) + /r(0|l)P(T = 1). 

The fundamental problem of causal inference is to estimate the “counterfactual” 
means A^(0|1) and ^(1|0). In this paper we focus on the estimation of ^(0|1). 

Along with the treatment exposure Ti and outcome 17, each experiment unit i is 
usually associated with another set of covariates denoted by Xi. In observational 
studies, both treatment assignment and outcome can be related to the covariates, 
which may cause serious selection bias. However, as highlighted by the seminal 


work Rosenbaum and Rubin (19831), it is possible to correct this selection bias 


under the following two assumptions: 

Assumption 1 (strong ignorability). (F(0),F(1)) YT \ X. 

Assumption 2 (overlap). 0 < P(T = 1|A) < 1. 

Intuitively, the first assumption implies that the observed covariates contain all 
the information that may cause the selection bias, i.e. there is no confounding 
variable, and the second assumption ensures this bias-correction information is 
present across the entire domain of X. We will see these two assumptions play a 
foundational role in the doubly robust estimators (Section |2.2[ ) and the EB method 
(Section]^ and Section 4.2). 

Since the covariates X contain all the confounding information of selection bias, 
it is important to understand the relationship between T, Y and X. Under Assump¬ 
tion 1 (strong ignorability)! the joint distribution of (A, Y, T) is determined by the 
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marginal distribution of X and two conditional distributions given X. The first con¬ 
ditional distribution e(X) = P{T = 1|X) is often called the propensity score and 
plays a central role in causal inference. Rosenbaum and Rubin (1983) proved that 
under Assumptions [i~( strong ignorability) |and |2 (overlap)| (y(0), y(l)) _L T \ e{X), 
i.e. the propensity score function e{X) itself is sufficient to produce unbiased esti¬ 
mates of causal effect. The second conditional distribution is the density of T(0) 
and T(l) given X. Since we only consider the mean causal effect in this paper, 
it suffices to study the regression function i?[y( 0 )|A] and i?[y(l)|A], denoted by 
gQ{X) and gi{X). 


2.2. Doubly Robust Estimators. Many existing estimators of mean causal ef¬ 


fect are special cases of the doubly robust estimation framework (Robins et al. 


1994). Readers are recommended to check the subsequent literature (e.g. Robins 


and Wang 2000 Bang and Robins 2005) and the review article by Kang and 


Schafer| (2007) for more details about doubly robust estimators. When using dou¬ 


bly robust estimators, first we need to build individual models for both propensity 
score e{X) and outcome {g^^X)^ gi{X)), then combine the two models with a dou¬ 
bly robust estimator. We review this procedure in this section. 

The first component of doubly robust estimators is called inverse-probability 
weighting (IPW). Suppose we have an estimate of e{X), denoted by e{X). For 
simplicity, we write e(W) as e^. The (weight-normalized) IPW estimators are: 


( 2 ) 


.^IPW ^ 


E 

Ti = l 


1 

ni 


-K: - 


E 

Ti^O 


ei(l - ei) 


-1 




-1 


Y,,. 


Here =t defined as summation over all units i such that Ti = t. This notation 
will be repeatedly used throughout the paper. 

In this formula, the control units are associated with weights proportional to 
ei{l — ei)~^ to resemble the full population. The most popular choice of obtaining 
e(X) is via logistic regression, where logit(e(A)) is modeled by 
Cj(X) are functions of the covariates. 

The second component of doubly robust estimators is an outcome regression 
model, which estimates the mean potential outcomes gt{X) by gt{X), t = 0,1. A 
common choice of gt is the linear regression model gt{X) = Pt,jCj{X), which 

leads to the OLS estimators: 

( 3 ) 7 °"^® = - 


This approach is also called “covariance adjustment” in the literature (Rosenbaum 


and Rubin 

1983 

Rosenbaum et al. 

2002 


The third and last component, a doubly robust semiparametric estimator, com¬ 
bines the first two models using residual bias correction. The following example 


was first proposed by 

Cassel et al. 

(1976 

by Robins et al. 

(1994 

1 : 


( 4 ) 


.^DR ^ ^OLS 


E 


e*(l - ei) 


-1 




-1 


(y,-5o(^^))■ 


7 ^^ can be viewed as a correction of 7 '^^. The bias correction term replaces 
in the IPW estimators with the regression residual. The estimator 7 ^^ have 
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the desirable double protection we mentioned earlier and the following efficiency 
property. 

Property 2. If both propensity score and outcome regression models are correctly 
specified, the estimator asymptotically achieve the semiparametric variance bound 


derived in {Hahn 1998). 


Proposition 1. 




and 7 ^^ satisfy Property\^ 


Proof. Here we only prove the double robustness property and refer the readers to 


(Rothe and Firpo 2013) for a proof for efficiency. 


In the case that the outcome model is correct specified (i.e. go{X) —>• goiX) 
as n — )• oo), f®'"® and are already consistent, and the bias correction term 

in has mean going to 0 as n —> oo. In the case where the outcome model is 
misspecified but the propensity scores are correctly specified, the additional term 
in Q are consistent estimators for the bias of the OLS estimators. □ 

Since all models are vulnerable to misspecification, the additional protection DR 
estimators provide can give the applied researchers much more faith on their causal 
effect estimates. However, Kang and Schafer (2007) point out the standard DR 


estimator could behave poorly when both models are misspecified. Many sub¬ 
sequent doubly robust estimators are proposed to further improve the performance 


et al. ( 2012 ). 


under misspecified models, see Tan (2006); Cao et al. (2009); Tan (2010); Rotnitzky 


The doubly robust property can also be satisfied through weighted regression. 


We refer the reader to (Kang and Schafer 2007 Freedman and Berk, 2008) for more 
details. In Section |4.3| we will construct a doubly robust weighted least squares 
estimator of 7 using Entropy Balancing. 

3. Entropy Balancing 


In this section, we review the definition of Entropy Balancing (Hainmueller 


2011). We also expose connections of the Entropy Balancing optimization problem 


with propensity scoring and outcome regression. 

The EB procedure generates a set of weights for all the control units that are 
used to estimate PATT in equation ([^ . EB operates by maximizing the entropy of 
the weights under a set of balancing constraints: 


maximize — 


( 5 ) 


subject to 


E 

Ti=0 

E 

Ti^O 

E 

Ti=0 
Wi > 0 


Wi log Wi 


w^Cj{Xi) = c^(l), j = 1 ,... ,_p. 


Wi 


= 1 , 


i = 1, 


,n. 


Hainmueller 


( 2011 ) refers to the set {cj{-)}^^i as moment functions of the covari- 


ates. These can be essentially any transformation of X, not necessarily polynomial 


functions. In Section |4Tl and |4721 we show in Section 2^ that they indeed serve the 
same purposes as linear features in the propensity score model or outcome model. 
In order to weight the control units to resemble the treatment population, the bal¬ 
ancing targets Cj(l), j = l,...,p are the empirical “moments” of the treatment 
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population Cj(l) = Throughout the paper we will use c{X) and 

c(l) to stand for the vector of Cj(X) and Cj(l), j = 1,... ,p. 

Since EB seeks to empirically match the control and treatment covariate dis¬ 
tributions, we can draw connections between EB and density estimation. This 
connection allows us to expose the relationships between EB and propensity score 
modeling. In particular, the EB optimization problem ([^ is a finite sample version 
of the minimum relative entropy principle in density estimation. Letting m{x) be 
the density function of the covariates X for the control population, the minimum 
relative entropy principle estimates the density of the treatment population as 


(6) maximize iL(7fiIIm) subject to Em[c(X)] = c(l). 

m 


Here, H[m\\m) = Em[log(TO(X)/m(X))] is the relative entropy between the dis¬ 
tribution TO and the distribution to. As an estimate of the distribution of the 
treatment group, the optimal to of® is the “closest” to the control distribution 
among all distributions satisfying the moment constraints. 

Let w{x) = [P(T = 1) • TO(x)]/[P(r = 0) • m{x)] be the population version of the 
inverse probability weights in (H). Applying a change of measure, we can rewrite 
® as an optimization problem over w: 


(7) maximize Em [w(AT) log re(X)] subject to Em [w(X)c(Ar)] = c(l). 


EB optimization ® is the a finite sample version of this problem, where the dis¬ 
tribution TO is replaced by the empirical distribution of the control units. 

Next, we show the above heuristic allows us to view EB as a propensity score 
model. By using the Lagrangian multipliers, we can show the solution to ([6 ) belongs 


to the family of exponential titled distributions of to (Cover and Thomas 20121: 


mg{x) = m{x) exp(0'^c(x) — 


Here, V'(^) is the moment generating function of this exponential family. Conse¬ 
quently, the solution of 0 is 

P(T = l\X = x) 


( 8 ) 


= w{x) = exp(a -I- 6^ c(x)) 


P(T = 0|X = x) 

where a = log(P(T = 1)/P(T = 0)). Equation ® is exactly the logistic regression 
model of the propensity score using c(x) as features. This implies EB implicitly fits 
a logistic regression model. 

However, the EB estimate differs from the maximum likelihood estimate of lo¬ 
gistic regression. The dual optimization problem of ® is 


(9) 


minimize 


I Ti=0 


log 


i=i 


i=i 


which is different from the maximum likelihood problem 


( 10 ) 


minimize 


2=1 


^ log j 1 -b exp f - {2Ti - 1) SjCjiXi 


i=i 


The loss functions in 0 and ( [To| are different. Further, 0 only uses the control 
units in the minimization. 
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The unique the solution 0^® to (|^ can be easily computed by Newton method 
and the EB weights (solution to the primal problem ([^) are immediately given by 
the Karush-Kuhn-Tucker conditions 


( 11 ) 




exp(Ei=i 




V T; = 0. 


Hainmueller 


(2011) proposes to use the weighted average 1° ®®11 


mate the counterfactual mean E[T(0)|r = 1]. This gives the Entropy Balancing 
estimator of PATT 

( 12 ) 


:,EB 


ri=i ^ ri=o 




Finally, EB is related to the existing literature that estimates mean causal effect 


by an empirical likelihood approach (Wang and Rao, 2002 Tan, 2006 Qin and 


Zhang 2007 Tan 2010). EB is different from these methods in two ways. First, 


no explicit propensity score or outcome regression model is built in the original 
proposal of EB, while all the other methods rely on at least one of them. Second, 
instead of the empirical likelihood objetive YH=i log'wijj EB uses Shannon entropy 
X)r=i the discrepancy function. This makes EB a convex optimization 

problem, and leads to a logistic propensity score model as we have shown earlier. We 
can easily generalize EB to other forms of the propensity score model by changing 
the discrepancy function. 

4. Properties of Entropy Balancing 
Due to its interpretability. Entropy Balancing has already been used in many 


applications (e.g. Marcus 2013 Ferwerda 2014). However, Hainmueller (2011) did 
not investigate its statistical properties. In this section, we study Entropy Balanc¬ 
ing in the framework of Rubin’s causal model and doubly robust estimators that 
are described earlier in Section In particular, we show why Entropy Balancing 
should be preferred over the conventional maximum likelihood logistic regression 
for propensity score. 

4.1. Existence. We first describe the conditions under which the EB problem (§ 
admits a solution. The existence of w®® depends on the solvability of the moment 
matching constraints 

(13) ^ WiCj { Xi ) = Cj(l), j = 1,... ,p, w > 0, Wj = 1. 

Ti =0 Ti =0 

As one may expect, this is closely related to the existence condition of maximum 


likelihood estimate of logistic regression (Silvapulle 1981 Albert and Andersen 


1984). An easy way to obtain such condition is through the dual problem of (10) 

n 

maximize — } [iCi logics -I- (1 — Wi) log(l — iCi)] 

2 = 1 

wic^{Xi) = wtCj(x,), j = 


Ti=o ri=i 

0 < iCi < 1, i = 1,... ,n. 


(14) 


subject to 
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Thus, the existence of is equivalent to the solvability of the constraints in 


(141, which is the overlap condition first given by Silvapulle (1981). 


Intuitively, in the space of c{X), the solvability of (13) or the existence of rc™ 
means there is no hyperplane separating {c{Xi)}Ti=o and c(l). In contrast, the 
solvability of (14) or the existence of means there is no hyperplane separat¬ 


ing {c{Xi)}Ti=o and {c{Xi)}Ti=i- Hence the existence of EB requires a stronger 
condition than the logistic regression MLE. 

The next proposition suggests that the existence of and hence is 

guaranteed by Assumption 


2 (overlap) with high probability. 


Proposition 2 . Suppose Assumption 2 (overlap) is satisfied and the expectation 
of c{X) exist, then P(ui®® exists) —>■ 1 as n —)■ oo. Furthermore, 
in probability as n ^ oo. 

The proof of this proposition is deferred to Appendix 


4.2. Double Linear-Robustness. We now give the main theorem of the paper, 
which demonstrates that although EB does not contain a propensity score model 
or a outcome regression model, it nonetheless has a double robustness property. 


Theorem 1 . Let Assumption \l (strong ignorabilii^ and Assumption 2 (overlap) 
be given. Additionally, assume the expectation of c{X) exists and Var(F(0)) < oo. 
Then Entropy Balancing satisfy Property^ and^ in the sense that logit(e(A)) = 
logit(P(T = 1|X)) and go(X) = E\Y({))\X] are modeled by linear functions of 
c(X). The exact statement is 

(1) // logit(e(A)) or go{X) is linear in Cj{X), j = 1,...,R, then 7 ®® is sta¬ 
tistically consistent. 

Moreover, if logit(e(A)), 50 (A) and 51 (A) are all linear in Cj{X), j = 
1,... ,R, then 7 ^® reaches the semiparametric variance bound of ^ derived 


( 2 ) 


Notice that this statement is similar to the results in Qin and Zhang (2007). The 


empirical likelihood (EL) estimator derived therein involves an explicit propensity 
score model and balancing some auxiliary functions, like the moment functions c{X) 
in Entropy Balancing. The main conclusion in Qin and Zhang (2007) is that the EL 
estimator has the same properties as in Theorem [Ij but the condition logit(e(X)) 
is linear in c{X) is replaced by the propensity score model is correctly specified. 

The proof of the first claim in Theorem reveals an elegant correspondence 
between the primal-dual optimization and the statistical property of double ro¬ 
bustness, so we sketch the proof here. The proof of the second claim is deferred to 
Section 14.41 


Proof sketch. The consistency under the linear model of Y (0) is easily obtained 
using the moment balancing constraints in the primal optimization problem ([^. 
To see this, suppose the true outcome model is 


5 o(A) = E[y(0)|A]=f]/3,c,(A) 


(15) 
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for some /3,- and moment functions Cj, j = By Assumption 1 (strong 

ignorability) [ we have 

E [F(0)|T = 1] = E [E [y(0)|A] \T = 1] 


(16) 


= E 


p 

T = 1 

J=i 



p 


= > 5,E[c,(A)|r=l]. 


i=i 

Let a = Yi{0) — J2j=i so ei is uncorrelated with Cj{Xi) for j = 1,... ,p. 

The EB estimator, on the other hand, is 

WiY^ = WiYi{0) 


Ti^O 


(17) 


Ti^O 


Ti^O 


P 


/3jCj{Xi) + ( 


j^l Ti^O Ti^O 

p 

= yi/3jc(l) + W^e^. 


i=i 


Ti=0 


,EB 


Since c(l) is unbiased and consistent for E[cj(A)|r = 1], it is obvious that 7 
is always unbiased for 7. Furthermore, because^j wf A 0 by Proposition and 
Var(ei) < 00, the consistency of EB also follows immediately. 

The consistency under the linear model of logit (P(T = 1|A)) is a consequence of 
the dual optimization problem (|^. Intuitively, the dual problem ([^ can be viewed 
as fitting a logistic regression model of the propensity using a loss function different 
from the usual binomial likelihood, as discussed in[^ In Appendixwe prove this 
heuristic rigorously using the M-estimation theory. 

□ 


Remark 1. The calculations in (16) and ([1 
target E[y(0)|r = 1] is 

(18) 


implies the difference of the target 


Ti^O 


w. 


:y.-E[F(0)|T = l]=y]/3, 
1=1 


.i=l 


w^Cj{Xi)-Y[cj{X)\T =1] 


Ti=0 


W.ti 


The decomposition (18) only requires the true outcome model is (15). It holds 
for any weighting estimator, including propensity score matching and stratification 
which are just special ways of weighting. On the right hand side of (18), the 
first term is related to the covariate imbalance of the weighting method, while 
the second term is purely random noise. Thus, for a particular realization of the 
observational data {X,Y,T), the role of covariate balance is to minimize the bias 
of the weighting estimator. Conversely, if the propensity score model does not 
balance a covariate moment and the moment is correlated with Y (0), that means the 
estimator is always biased for this realization. When we observe many realization 
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of the same observational study, this bias will have mean 0 only if the propensity 
score is correctly specified. 


4.3. Entropy Balancing and Outcome Regression. Although the original es¬ 
timator 7^® in (12) does not include an outcome regression model, Theorem 
shows it still has a double robustness property. In this section we consider the 
combination of entropy balancing weights and a user-specified outcome regression 
model. 

The heuristic that EB implicitly fits a logistic regression model suggests that it 
can be used in the doubly robust estimator framework. If we replace the inverse 
probability weight ei/(l — e^) in Q by the EB weight in ([^, a doubly robust 
estimator of 7 is 


(19) 


7 


,EB-DR 


tT^i r^o 


Corollary 1. ^eb dr consistent f/logit(e(X)) is linear in {cj{X)} or gcj{x) is 
correctly specified for goix). 


An important conclusion of our paper is Entropy Balancing also fits a linear 
outcome model. This is best seen from the following theorem 


Theorem 2. If the fitted outcome regression model is go{X) = PjCj{X), 

whether or not this model is correctly specified, ^e^-dr _ ;^eb_ 


Proof. This simple algebraic fact is easily proved by 

p 


^eb-dr _ ^EB 




Ti=0 3 = 1 

P / 


ni 


T=ii=i 


= E^M - — E 


3 = 1 \ Ti =0 

= 0 . 


ni 


Ti = l 


□ 

This theorem gives a new perspective on the moment constraints in (§• Typ¬ 
ically, applied researchers are suggested to check the covariate balance in order 
to make sure the propensity score model is not too biased. However, Theorem 
indicates that asking exact covariate balance in the finite sample actually fits an 
implicit outcome regression model. As a result. Entropy Balancing satisfies the 
doubly robust property stated in Theorem 

In principle, we can use this observation to derive a number of doubly robust esti¬ 
mators. More specifically, we can fit any propensity score model with the additional 
exact moment balancing constraints. The resulting PATT estimator is doubly ro¬ 
bust to that propensity score model and the linear outcome regression model. We 
prefer Entropy Balancing among this type of estimators because it corresponds to 
the most commonly used logistic regression model. 
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4.4. Efficiency. In this section, we compute the asymptotic variance of 7'®'® using 
the general M-estimation theory. Let 


4'j{X,T;m) = T{cj{X) - mj), j = 

V-,(X,T;0,m) = (1 - - m,),j = 1 ,... ,p, 

p^,{X,T,Y-,p{l\l)) =T{Y - p{l\l)), 
piX, T, Y; 9, 7) = (1 - W(y + ^ _ ^(i|i)), 


and ({X,T,Y-,m,6, = {(p'^,tp '^be all the estimating equations. 

The Entropy Balancing estimator 7®® is the solution to 


( 20 ) 


^C(^*,T)A*;m,0,M(l|l),7)=O. 


In Appendix]^ we use this to prove the consistency of 7^® when the propensity 
score is logit linear in {cj{X)}^^^. Here we compute the asymptotic variance of 


7^^ following the approach described in Stefanski and Boos (2002 sec. 2). To state 


our result, we need to introduce three different kinds of weighted covariance-like 
functions for arbitrary length p random vectors ai and 02 

^^ai.aa = Cov(ai,a2|T = 1), 
e(A) 


n — E 

^ai,a2 — ^ 


(ai - E[ai|r = l])(a2 - E[a2|T = 1])' \T = 1 


Ll-e(A) 

=E[(l-e(A))aiaLT= 1], 

KZa, = E[(l - e(A))ai(a2 - E[a2|r = 1])^|T = 1]. 

It is obvious that H > K and usually G > H. To make the notation more 
concise, c{X) will be abbreviated as c and E(0) as 0 in subscripts. For example, 
Hc,o = Hc(x),y{o), Gc,i = Gc{x},y{i) and Kc = Kc(x),c{x)- 

Theorem 3. Assume the logistic regression model ofT is correct, i.e. logit(P(T = 
1|A)) is a linear combination of {cj{X)}^^^. Let tt = P(T = 1), then we have 

N(7, and N(7, where 


;,EB 


(21) P™ = TT-^ ■ {iLi + Go - (2G,,o - i?c.o - + 2i?c.i)} , 

(22) piP'^ = TT-^ ■ {ifi + Go - (7J,.o - 2if- + ‘^Ki)) ■ 

The proof of this theorem is in Appendix [B| The H, G and K matrices in The¬ 
orem can be estimated from the observed data, yielding approximate sampling 
variances for 7®® and 7^^^^. Alternatively, variance estimates may be obtained 


via the empirical sandwich method (e.g. Stefanski and Boos, 2002). In practice 


(particularly in simulations where we compare to a known truth), we find that the 
empirical sandwich method is more stable than the plug-in method, which is con¬ 


sistent with the suggestion in Lunceford and Davidian (2004) for PATE estimators. 


To complete the proof of the second claim in Theorem [T] we compare these 
variances with the semiparametric variance bound of 7 derived by Hahn (|1998[): 


(23) V* =7r-i • {Hi +Go- 2ff 


iJ[v(o)|x].i5[y(i)|x] 


-G 


E[y(o)|x] 




ElYmx] 


}• 
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Assume logit(P(r = 1|X)) = 0^c(X) and E[y(i)|X] = j3{t)'^c{X), t = 0,1. Then 
H,^t = CoYic{X),Y{t)\T = l) 

= Cov{c{X),/3{tfc{X)\T=l) 

= Hc/3(t), for f = 0 , 1 . 


Similarly, Gc,t = GcPit), t = 0,1. From here it is easy to check F®® and V* are the 
same. Since Entropy Balancing reaches the efficiency bound in this case, obviously 

yEB ^ ^IPW 


If logit(P(T = l|Ai)) = 0'^c(X) but E[y(f)|Ar] = j3{tY'c{X) is not true for some 
t = 0 , 1 , there is no absolute guarantee that practice, the features 

c{X) in the linear models of E are almost always correlated with Y. This correlation 
compensates the slight efficiency loss of not maximizing the likelihood function in 
logistic regression. As a consequence, the variance E™ in ( 21 ) is usually smaller 
than in (22). This efficiency advantage of EB over IPW is verified in the 

next section using simulations. 


5. Simulations 


In this section, we use simulated examples to verify the theorems and claims 
we made earlier. We will also compare EB weighting with IPW (after logistic 
regression) and the Covariate Balancing Propensity Score (GBPS) proposed by 

. See Section]^ for more discussion of CBPS. 


Imai and Ratkovic (2014 


5.1. Kang-Schafer Example. First we consider the simulation model in (Kang 


and Schafer 2007), which demonstrates that “in at least some settings, two wrong 
models are not better than one” for the doubly robust estimator ([^ . The simulated 
data consists of {Xi, Zi,Ti,Yi},i = l,...,n}. Xi and Ti are always observed, 
Yi is observed only if = 1, and is never observed. To generate this data 
set, Xi is distributed as N( 0 ,/ 4 ), Zi is computed by first applying the following 
transformation: 


Zii = exp(Wi/2), 

Zi2 = Xi2/{1 + exp(Aiii)) + 10, 

Za = (XnXa + 0.6f, 

Zi4 = {Xi2 + Xi4 + 20)^. 

Next we normalize each column such that Zi has mean 0 and standard deviation 1. 

In one setting, Yi is generated by Yi = 210 + 27.4X^1 + 13.7X^2 + 13.7X^3 + 
13.7Xi4 + Ci, ti ^ N(0,1) and the true propensity scores are = expit(—X^i + 
0 .5Xi2 — 0.25X^3 — 0.1Xi4). In this case, both Y and T can be correctly modeled 
by (generalized) linear model of the observed covariates X. 

In the other settings, at least one of the propensity score model and the outcome 
regression model is incorrect. In order to achieve this, the data generating process 
described above is altered such that Y or T (or both) is linear in the unobserved Z 
instead of the observed X, though the parameters are kept the same. 

For each setting (4 in total), we generated 1000 simulated data sets of size 
n = 1000 and apply various methods discussed earlier, including 

(1) IPW, CBPS: the IPW estimator in (H with propensity score estimated by 
logistic regression or CBPS; 
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(2) EB: the Entropy Balancing estimator in (12); 

(3) IPW+DR, CBPS+DR: the doubly robust estimator in @ with propensity 
score estimated by logistic regression or CBPS. 

Notice that the correct mean of Y is always 210. The simulation results are shown 
in Figure The numbers printed at the top of the hgure are standard deviations 
of each method. 

The reader may have noticed some unusual facts in this plot. First, the doubly 
robust estimator “IPW+DR” performs poorly when both models are misspecified 
(bottom-right panel in Figure]^. In fact, all the three doubly robust methods 
are worse than just using IPW. Second, the three doubly robust estimators have 
exactly the same variance if the Y model is correct (top two panels in Figure]^. 
It seems that how one fits the propensity score model has no impact on the final 
estimate. This is related to the observation in ( Kang and Schaf^ |2007[ ) that, 
in this example, the plain OLS estimate of Y actually outperforms any method 
involving the propensity score model. Discussion articles such as Robins et al. 


(2007) and Ridgeway and McCaffrey (20071 find this phenomenon very uncommon 
in practice and is most likely due to the estimated inverse probability weights are 
highly variable, which is a bad setting for doubly robust estimators. 

Although the Kang-Schafer example is artificial and is arguably not very likely 
to occur in practice, we make two comments here about entropy balancing in this 
unfavorable setting for doubly robust estimators: 

(1) If both T and Y models are misspecified, EB has smaller bias than the 
conventional “IPW+DR” or “CBPS+DR”. So EB seems to be less affected 
by such unfavorable setting. 

(2) When T model is correct but Y model is wrong (bottom-left panel in Figure 
[^, EB has the smallest variance among all estimators. This supports the 
conclusion of our efficiency comparison of IPW and EB in Section |4.4| 


Finally we want to notice that the same simulation setting is used in Tan (2010) 


to study the performance of a number of doubly robust estimators. The reader can 
compare the Figure]^ with the results there. Overall, the performance of Entropy 
Balancing is comparable to the best estimator in Tan (2010). 


5.2. Lunceford-Davidian Example. Here we consider another comprehensive 


simulation example, which is used by Lunceford and Davidian (2004) to study 
estimators of PATE r. We find this example is suitable to investigate estimators of 
PATT 7 as well. 

In this simulation, the data still consists of {{Xi, Zi,Ti,Yi},i = 1,... ,n}, but 
all of them are observed. Both Xi and Zi are three dimensional vectors. The 
propensity score is only related to X through: 


iogit(P(r, = 1)) = /?o + ^ 

Note the above does not involve elements of Zi. The response Y is generated 
according to 

3 3 

Yi — vq + 'y ( ^j^ij + ^4Ti + y ( ^jZij + ~ 

t=i i=i 


N(0,1). 
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PS.model: correct PS.model: Incorrect 

230- 1.659 1.538 1.139 1.139 1.139 1.601 1.345 1.148 1.148 1.148 


220 - 


210 - 




JL L 


• • • 

JL J. 


rT 


230- 1.105 0.9995 0.8206 1.589 1.337 1.198 1.12 0.8614 1.576 1.051 


220 - 


210 - 




r 


T 




\p'\N qbPS cBP® vp'n'+o^qbps+o^ 

Method 


Figure 2. Results of the Kang-Schafer Example. The methods 
are: Inverse Propensity Weighting (IPW), Covariate Balancing 
Propensity Score (CBPS), Entropy Balancing (EB), and doubly 
robust versions of the first two (IPW-pDR, CBPS-PDR). Both 
propensity score model and outcome regression model can be cor¬ 
rect or incorrect, so there are four scenarios in total. We generate 


1000 simulations of 1000 sample from the example in (Kang and 
Schafer 2007| and apply five different estimators. Target mean is 


210 and is marked as a black horizontal line to compare the bi¬ 
ases of the methods. Numbers printed at E = 230 are the sample 
standard deviation of each method, in order to compare their effi¬ 
ciency. 


The parameters here are set to be 


p= (0,-1,1,-1,2)^; 


OR.model: correct OR.model: Incorrect 
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(3 is set as: 

/3"° = (0,0,0, Of, 

^moderate ^ 3^ _Q 3 ^ q_ 3)T^ Or 

^strong ^ (0,0.6,-0.6, 0.6)^. 

The choice of /3 depends on the level of association of T and X. ^ is based on a 
similar choice on the level of association of Y and Z: 

r° = (o,o,of, 

^moderate ^ (_0.5, 0.5, 0.5)"^, Or 

ft™„g=(-l,l,lf. 

The joint distribution of {Xi,Zi) is specified by taking ^ Bernoulli(0.2) and 
then generate as Bernoulli with 

P {Za = 1|X,3) = 0.75X,3 + 0.25(1 - X,3). 

Conditional on Xi3, {Xu, Zn, Xi 2 i Zi 2 ) is then generated as multivariate normal 


N(aAi3 


/ 


Bq = Bi = 


-1,-1)^ 

) oo — 

(-1, 


0.5 

-0.5 

0.5 

I 

-0.5 

-0.5 - 

-0.5 

I 

-0.5 - 

-0.5 

0.5 


and 


-0.5^ 
-0.5 
0.5 
1 / 


The data generating process implies the true PATT is 7 = 2. Since the outcome 
Y depends on both X and Z, we always fit a full linear model of Y using X and Z, 
if such model is needed. T only depends on X, so it is not necessary to include Z 
in propensity score modeling. However, as pointed out by |Lunceford and Davidian| 
(2004 Sec. 3.3), it is actually beneficial to “overmodel” the propensity score by 
including Z in the model. Here we will try both possibilities, the “full” modeling 
of T using both X and Z, and the “partial” modeling of T using only X. 

We generated 1000 simulated data sets and the results are shown in Figure 
for “full” propensity score modeling and Figure for “partial” propensity score 
modeling. We make the following comments about these two plots: 

(1) IPW and all the other estimators are always consistent, no matter what 
level of association is specified. This is because the propensity score model 
is always correctly specified. 

(2) When using full propensity score modeling, all the doubly robust estimators 
(EB, IPW+DR, CBPS+DR and EB+DR) have almost the same sample 
variance. This is because all of them are asymptotically efficient. 

CBPS, to our surprise, does not perform very well in this simulation. It has 
smaller variance than IPW but this comes with the price of some bias. If we 
use the partial propensity score model (only involve X, Eigurej^, this bias 
is a little smaller but still not negligible. While it is not clear what causes 
this bias, one possible reason is that the optimization problem of CBPS 
is nonconvex, so the local solution which is used to construct 7 estimator 
could be far from the global solution. Another possibility is that CBPS 
uses GMM or Empirical Likelihood to combine likelihood with imbalance 
penalty, which is less efficient than maximum likelihood directly. Thus, 
although the estimator is asymptotically unbiased, the convergence spend 


( 3 ) 
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\?'i’ o®®® 5® vp'i* Q©?® e® 


Method 


Figure 3. Results of the Lunceford-Davidian example (full 
propensity score modeling). The propensity score model and out¬ 
come regression model, if applies, are always correctly specified, 
but the level of association between T or Y with X or Z could be 
different, ended up with 9 different scenarios. X are confounding 
covariates and Z only affects the outcome. We generate 1000 simu¬ 
lations of 1000 in each scenario and apply five different estimators. 
Target PATE is 2 and is marked as a black horizontal line to com¬ 
pare the biases of the methods. Numbers printed at y = 5 are the 
sample standard deviation of each method, in order to compare 
their efficiency. 


to the true 7 is quite slower than IPW. CBPS combined with outcome 
regression (CB-I-DR) fixes the bias and inefficiency issue occurred in CBPS 
without outcome regression. 


OR.assoc.Z: no OR.assoc.Z: moderate OR.assoc.Z: strong 
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1 

0 
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1 

0 

5 
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2 
1 
0 

Method 

Figure 4. Results of the Lunceford-Davidian example (partial 
propensity score modeling). The settings are exactly the same as 
Figure [^except the methods here don’t use Z in their propensity 
score models. 

(4) EB, in contrast, performs quite well in this simulation. It has relatively 
small variance overall, especially if we use the “full” model, i.e. both X and 
Z are being balanced in ([^. 

(5) The difference between EB and EB+DR is that while EB only balances 
“partial” or “full” covariates, EB+DR additionally combines a outcome 
linear regression model on all the covariates. As proved in Theoremj^ when 
the “full” covariates are used, EB is exactly the same as EB+DR. We can 
observe this from Figure When EB only balances “partial” covariates, 
the two methods are different and indeed EB+DR is more efficient in Figure 
m since it fits the correct Y model. 

(6) Using the “full” propensity score model improves the efficiency of pure 
weighting estimators (IPW, GBPS and EB) a lot, but has very little impact 


PS.assoc.X: moderate 
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on estimators that involves an outcome regression model (IPW+DR and 
CBPS+DR) compared to ’’partial” propensity score modeling. Although 
EB could be viewed as fitting a outcome model implicitly, the ’’partial” EB 
estimator only uses X in the outcome model, that is precisely the reason 
why it is not efficient. Thus there are both robustness and efficiency reasons 
that one should include all relevant covariates in EB, even if the covariates 
affect only one of T and Y. 


6. Extensions 

6.1. Weighted regression. The weights generated by EB can be used in a weighted 
outcome regression model 

(24) ^EB-WLS ^ (Y, , 

\Ti=0 / \Ti=0 / 


(25) 


.:j,EB-WLS 


ni 


Ti = l 


^EB-WLS 


The weighted regression estimator ^eb-wls same double robustness prop¬ 
erty as 7®^ because it fits a linear model on Y. In general, unlike 3^3 

Theorem ^eb-wls jg 33 |;]^g original estimator 7®®. 


6.2. Relaxed balancing. We next propose a relaxation of Entropy Balancing and 


compare it to a recently proposed robust method of fitting propensity scores (Imai 


and Ratkovic, 2014). 


One concern about Entropy Balancing is that the exact moment balancing con¬ 
straints could be too stringent, especially if we have a lot of them. This could 
happen in high dimensional problems or if we wish to balance many interactions. 
In such a case, the solution to the EB problem ([^ might not exist. Even if does, 
the weights could have very high variance. Thus, we propose an extension to opti¬ 
mization problem in ([^ with relaxed constraints: 


(26) 


1 -Y 

minimizCw 

T=0 j=l 

subject to Wi = 1, 

Ti=0 

Wi > 0, i = 1 ,..., n. 


E w^CJ{Xi) - c{l) 


.Ti=0 


It is obvious that if A = 0, (26) is equivalent to ([^. The dual problem of (26) is 


(27) minimize^ log ^ exp I ^ 6»j(cj(X,) - Cj(l)) j -k ^|16»||^. 


This means the “softened” EB essentially fits a logistic model by minimizing I 2 - 
regularized exponential loss on half of the data. Regularization is known to improve 
estimation and prediction accuracy by reducing variance. However, this reduction 
comes with a price of adding bias to the statistical inference. Since A = 0 in 
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(261 corresponds to (§, EB employs no regularization ( |27| . Therefore, it gives an 
unbiased estimator of mean causal effect, as we have already seen from Theorem 
Recently, Imai and Ratkovic (20141 proposed Covariate Balancing Propensity 
Score (CBPS), which is related to this relaxation of EB. CBPS explicitly builds a 
logistic propensity score model and tries to make it robust by combining covariate 
balancing constraints. However, this robust approach is different from the dual 
optimization (27) in the following ways: 

( 1 ) “ 


( 2 ) 


CBPS replaces the “half’ exponential loss in (27) with the full logistic 
likelihood loss. 

Instead of the norm of 0, CBPS penalizes by the covariate imbalance in¬ 


duced by the fitted propensity score model, similar to the soft balancing 
term in the EB primal problem( |26[ ). 

(3) CBPS combines the likelihood loss in point (1) and the covariate imbalance 
in point (2) using Generalized Method of Moments or Empirical Likelihood, 
two common approaches for solving general estimating equations. 

Our simulations in show that EB usually performs better than CBPS. We make 
two additional notes about CBPS here: first, since it is purely a robust way to 
fit the propensity score model and does not involve any modeling of Y, it can 
be combined with an outcome regression model to achieve double robustness (see 
Section 2.2); second, unlike Entropy Balancing, the optimization problem CBPS 
proposes is non-convex, has no guarantee of finding the global optimum, and is 
much slower to solve in practice. 


7. Concluding Remarks 

7.1. The role of covariate balance. As a concluding remark, we argue that the 
two arms of double robustness are not completely independent. Instead, they are 
collaborating or even competing towards the same goal—covariate balance. This 
connection gives a new view on using covariate balance as a diagnostic for propensity 
scoring. 

Eirst, suppose we are given a propensity score model that already balances some 
covariate moments. Theorem indicates that “double-robustifying” the corre¬ 
sponding IPW estimator by an outcome linear regression model on those moments 
does not change the estimator at all. In other words, this outcome regression model 
is automatically built-in when we ask for covariate balance in propensity scores. 

Second, a linear regression model of the outcome also balances the covariates. 
Consider a linear regression of the potential outcome Y (0) on the covariate moments 
Cj(X),j = 1,2,... ,p. Let A* denote the matrix Ab = Cj{Xi) for i in the group 
t = 0 or 1 and A* denote the vector of outcomes in group t = 0 or 1. An OLS 
estimate of the coefficients is 

^=[{X°fX°]-\X°fY°. 

The OLS estimator ([^ of E[A(0)|T = I] is simply 

— l^(A^/3) = —1^ {aH(A°)^A°]-1(A°)^| Y°, 

Til Til ^ ^ 

which can be thought as a weighted sum of A°. These weights, when applied to 
the covariates of group 0, automatically match the unweighted average of group 1. 
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That is, if we let Wi be the z-th element of for i in 

group 0, then 

'y WiCj{Xi) = y ^ Cj{Xi), 

T.=0 1 T. = l 

which is exactly the same as the moment balancing constraints in ([^. The deriva¬ 
tion here implies, surprisingly, that the balancing weights can be also obtained by 
outcome regression. 

For the two reasons above, both propensity scoring and outcome regression are 
trying improve the covariate balance as illustrated in Figure Further, a logistic 
regression model of propensity scores and a linear regression model of outcome 
exhibit a certain degree of equivalence. 


7.2. Conclusions. Entropy Balancing is motivated by and directly addresses the 
“propensity score tautology” of repeatedly checking covariate balance. When a 
researcher decides to check the covariate balance after modeling propensity scores, 
he or she is implicitly checking for robustness with respect to certain outcome 
models (the linear models on those covariate moment she checks). The researcher 
is also forfeiting robustness to all other outcome models (see Remark , unless 
he or she is truly confident about her propensity score model (for example, only 5 
degrees of freedom are used in the propensity score model and 100 target covariate 
moments are balanced). By using the principle of relative entropy minimization. 
Entropy Balancing is the “minimal” model for this purpose and should be preferred 
for its simplicity as well as the strong statistical properties shown in this paper. 


Appendix A. Proof of Proposition [2] 

Since the expectation of c(A) exist, the weak law of large number says c(l) A 
c*(l) = E[c(A)|T = 1]. Therefore 

Lemma 1. For any e > 0, P(|jc(l) — c*(l)||oo > e) —t 0 as n —>■ oo. 

Now condition on ||c(l) — c*(l)||oo > e, i-e. c(l) is in the box of side length 2e 
centered at c*(l), we want to prove that with probability going to 1 there exists 
w such that wt > 0, ~ ^ ~ ^(1). Equivalently, this 

is saying the convex hull generated by {c{Xi)}Ti=o contains c(l). We will indeed 
prove a little stronger result: 


Lemma 2. With probability going to 1 the convex hull generated by {c{Xi)}Ti=o 
contains the box i?e(c*(l)) = {c(a:) : ||c(a:) — c*(l)||oo < e} for some e > 0. 


Proposition follow immediately from Lemma and Lemm a [2| Now w e prove 
LemmaDenote the sample space of A by D)A). Assumption 2 (overlap) implies 
c*(l) hence Be(c*(l)) is in the interior of the convex hull of il{X) for sufficiently 
small e. Let Ri, i = 1,..., 3^, be the 3^ boxes centered at c* (1) -|- |e6, where b gW 
is a vector that each entry can be —1, 0, or 1. It is easy to check that the sets Ri are 
disjoint and the convex hull of {xi}^!^ contains i3e(c*(l)) if Xi G Ri, z = 1 ,..., 3^. 
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Since 0 < P{T = 0|X) < 1, p = mini P(^ € Ri\T = 0) > 0. This implies 


(28) 


3^ 

P{3X, e R^ and Ti = Q, Vi = 1,... ,3^) > 1 - ^P(X ^ i?i|r = 0)” 

Z=1 

> l-3P(l-p)" 

^ 1 


as n — >■ (X). This proves the lemma because the event in the left hand side implies 
the convex hull generated by {c{Xi)}Ti=o contains the desired box. Note that (28) 
also tells us how many samples we actually need to ensure the existence of 
Indeed if n > p“^(plog2 + logi5“^) > log(2_p)(^2“^’), then the probability in (28) 
is greater than 1 — <5. Usually we expect <5 = 0(3“^). If this is the case, the number 
of samples needed is n = 0(p • 


Now we turn to the second claim of the proposition, i.e. X)r =o prove 

this, we only need to find a sequence (with respect to growing n) of feasible solutions 
to (^ such that max^ Wi —> 0. This is not hard to show, because the probability in 
(28) is exponentially decaying as n increases. We can pick m > N{d,p,p) such that 
the probability of the convex hull of contains B^{c*{l)) is at least 1 — 5, then 

pick m+i > rii + 3^N{d,p,p) so the convex hull of {xi}"l+’._|_^ contains i?e(c*(l)) 
with probability at least 1 — 3*5. This means for each i = 0,1,..., 

we have a set of weights such that +i ^iXi = c(l). Now suppose 

nk ^ n < rifc+i, the choice Wi = Wi/k if * < and Wi = 0 if i > rifc satisfies 
the constraints and max^ Wi < k. As n —^ oo, this implies max^ —>■ 0 and hence 
0 with probability tending to 1. 


Appendix B. Proof of Theorem [T] and [3] 

The first claim of Theorem is immediately proved by the following lemma: 

Lemma 3. Under the assumptions in Theorem^and suppose logit(P[T = IjA]) = 
then as n —T' oo, 0^® A- 9*. As a consequence, 


E 


E 


.Ti=0 


E[r(0)|T= 1]. 


Proof. The proof is a standard application of M-estimation (more precisely Z- 



(29) 


E(1 - T,)e^?^i®'^*='=(^-)(c,(A,) - c,(l)) = 0, j = 1 ,... ,i?. 


2=1 


We can rewrite ( [^ as estimating equations. Let (j)j{X^T\m) = T{cj{X) — mj), j = 
1 ,... ,i? and ^pj{X,T;9,m) = (1 - T) exp{J2l^i OkCkiX)}{cj{X) - m^-), then @ 


^Note that this naive rate can actually be greatly improved by Wendel’s theorem in geometric 
probability theory. 
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is equivalent to 


(30) 


= 0, j 

n 

= 0 , j = 


Since ^(•) and '!/;(•) are all smooth functions of 9 and m, all we need to verify is 
that m* = ¥j[cj{X)\T = 1] and 9* is the unique solution to the population version 
of (30). It is obvious that m* is the solution to T; m)] =0, j = 

Now take conditional expectation of ipj given X\ 

|X] = (1 - e(X))e^"=i - m*) 


= 1 - 


1 -I- ^k^k{x) 

gI2l=i 6kCk(X) 


-TOp 


The only way to make E['(/;j(X, T; 9, m*)] = 0 is to have 

ellfc=i SkCk{x) 


(c,(x)-E[c,(x)|r = i]). 


1 -(- (XXk=i 


= const • P(r = 1|X), 


i.e. 9 = 9*. This proves the consistency of 0^®. 
The consistency of 7 ^® is proved by noticing 




exp(Ei=i^rcj(^*)) p P{T, = l\Xi) 


Et,=o eMEU 1 - m = i | x ,) ’ 

which is the IPW-NR weight defined in ([^. 


□ 


The second claim is a corollary of Theoremj^ which is proved below. For simplic¬ 
ity we denote ^ = (m"^,0^, 7 )^ and the true parameter as ^*. Throughout 
this section we assume logit(e(X)) = E^j=i Denote c{X) = c{X) — c*(l), 

e*{X) = e{X-9*), 1*{X) = eMEU 9*c,{X)} = e*iX)/il - e*{X)). 

There are two forms of “information” matrix that need to be computed. The 
first is 

a 


= (E " 


an 


= E 


arrX 

( T-In 


ax,T,Y-,n 

——C(£*) 


E 


a 




an 


-^an 

ai 


0 


\ 


LV 


0 0 

{ 1 -T) 1 *(X)-Ir -(l-T)l*{X){c{X)-c-{l))c{xa 0 0 

O'T 0^ TO 

0 -(l-T)r(X)(Y(0)-M*(0|l))c(X)'r’ {l-T)l*{X) -(l-T)l*{X)J _ 

(Ir 0 0 0 \ 

Ir -Cov[c(a:)|t = 1] 0 0 

n 0 ^ 10 

Vo -Cov(Y(0),c(X)|T= 1) 1 - 1 / 
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A very useful identity in the computation of the expectation is 
E[f{X, Y)\T = 1] = 7r-^E[e{X)fiX, F)] 


p(r = 0) 


E 


i(x) 


1 - e{X) 


f{X,Y) 


T = 0 


The second information matrix is the covariance of C(-A, T,Y;^*). Denote Y (t) = 
Y{t)-fi*{t\l), t = 0,l 

' /Tc(x)c{x)'^ 0 Ty(i)c(x) 0 _ y 

0 {1-T)1*{X)‘^c{X)c{X)T 0 (l-T)l*{X)‘^Y{0)c{X) 

ry(i)c(x)^ 0^ ty2(i) o 

A 0 (i-r)«*(X) 2 y( 0 )c(x)^ o {i-T)i*{xfY^{o) )_ 

The asymptotic distribution of 7™ is N(7, )/n) where is the 

bottom right entry of A™(yLet’s denote 

-ffai.as = Cov(ai,a2|T = 1), 


Ga„a, = E [l*{X){ai - E[ai|r = l])(a2 - E[a2|r = 1])^|T = l]^ , 
and Ha = Ha,a, Ga = Ga,a- So 


=7r 




(Ir 

0 

0 




Ir 

-Hc(x) 

0 

0 



0^ 

0^ 

1 

0 

1 


VO 

-Hy(0),c(X) 

1 

-V 



Ir 


0 


0 

0 

A"’ 


0^ 

0 

1 

0 

0 


V ^I{X),Y{0)^c{X) ^I'{X),Y{0)^c{X) 1 


and 




( Hc(x) 
0 


H 


Y{l),c(X) 


0 

Gc{x) 

0^ 


Hc{x),y{i) 0 

0 


\ 


G'c(X),Y(0) 

- ^Y(l) 0 

V ^Y{0),ciX) 0 ^Y(O) J 

Thus 

= TT ^ • {hJqH^ ^ {Hc,o + GcH^ ^Hc,o — 2Gc,o ~ 2iLc,i) + Hi + Gq}. 

It would be interesting to compare with the asymptotic vari¬ 
ance of The IPW PATT estimator § is equivalent to solving the following 

estimating equations 

1 




-(- g-ELi SkCkiXi) 


Cj{Xi) = 0, r = 


1 

- V r„ r,; 0 , y 1 | 1 ), 7 ) = 0 , 

i—\ 

1 

-^yx„T„y,y,yi|i),7) = o. 

2 = 1 
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If we call Kai,a 2 = 

E[(l-e(A))aia^|T=l], 

we have 



\ /e*iX)(l-e*iX))ciX)c{X)T 

0 

° M 

= E 


T 

0 


[V -{1-T)IAX)Y{0)c(X)T 

{1-T)l*{x) 

-{l-T)l*{X)J _ 


K 

= n- \ 0 

k-H 






Y(0).c(A) 

K' 


0 0 

1 0 

1 -1 







Let q*{X) = e*{X)l*{X), 




/(T-e*{X))^c{X)c{X)'^ T{T-e*{X))Y{l)c{X) -(l-T)q*{X)Y{0)c{X) 


'(r) = E T{T-e*iX))Y{l)c{X) 


TY^{1) 
0 


L V-(l-r)g*(X)y(0)c(X) 0 {l-T)l*(X)^Y^(0) 

( :^c{X) -^c{X),Y(l) ^c(X),Y(0)~^<X),Y(0)\ 


c(X) 

K'^ - 

c(A),yn) 


H- 


Y(l) 


0 


\^i(X),Y(0) ^i(X),Y(0) ° Gy(0) 

yipw thus be computed consequently and we omit the details. 
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